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The aim of this study is to propose a new robust face recognition algorithm 
by combining principal component analysis (PCA), Triplet Similarity 
Embedding based technique and Projection as a similarity metric at the 
different stages of the recognition processes. The main idea is to use PCA for 
feature extraction and dimensionality reduction, then train the triplet similarity 
embedding to accommodate changes in the facial poses, and finally use 
orthogonal projection as a similarity metric for classification. We use the open 
source ORL dataset to conduct the experiments to find the recognition rates 
of the proposed algorithm and compare them to the performance of one of the 
very well-known machine learning algorithms k-Nearest Neighbor classifier. 
Our experimental results show that the proposed model outperforms the KNN. 
Moreover, when the training set is smaller than the test set, the performance 


TP contribution of triplet similarity embedding during the learning phase 
Stochastic gradient descent becomes more visible compared to without it. 
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1. INTRODUCTION 

Pattern recognition and computer vision are the fast-developing fields of computer science. Face 
recognition is the principal biometric method and compared to other methods such as fingerprint and iris 
recognition it does not require cooperation with the device. Face recognition finds applications in forensic 
investigations, identity verification in security points, healthcare, and game apps. Despite the promising 
progress in the field there are still many challenges due to the dynamic nature of the task including illumination, 
pose variations, occlusion, aging, and facial expressions [1]. Various techniques and algorithms were proposed 
to better identify and recognize facial images [2] and we briefly review some of the well-studied and classical 
approaches for still-image recognition. From a purely mathematical perspective, one can call a classifier linear 
or non-linear based on whether the techniques underpinning the algorithm make use of linear algebra or 
(differential) calculus, respectively. We first start recalling some linear techniques. 

One of the classical approaches is Eigenfaces [3], [4] which is based on principal component analysis 
(PCA). PCA is by far the most well-known unsupervised dimension reduction and the feature extraction 
method used in many disciplines. The main idea is to find an approximate basis consisting of eigenvectors of 
a covariance matrix from images in the dataset. For a detailed algorithm, we refer to the next section. 

Another closely related technique is fisherfaces based on linear discriminant analysis (LDA) [5], [6]. 
This is a class-specific approach where one tries to better arrange images before PCA is applied. The main 
distinction between PCA and LDA is that the latter tries to better understand the distinctions between classes 
of faces. k-Nearest Neighborhood (kNN) classifier is a common supervised machine learning method used in 
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clustering problems. Being simple, fast and having a high recognition rate make the KNN algorithm a popular 
classifier [7]. For a given positive integer k and the space of labeled images, the KNN finds k-nearest neighbors 
of an image to be classified according to a distance function. The image then gets labeled according to the 
majority votes of these k-nearest neighbors. 

All three classifiers mentioned above are examples of linear models. The last linear model we review 
here is the sparse representation based classifier (SRC) [8], [9]. In this model, a new face image is written as a 
linear combination of labeled images in the training set with lı-norm minimization and it is labeled according 
to the class that receives the maximum weight. While linear models receive high recognition rates for various 
available databases, the state of the art in face recognition techniques for large datasets is the family of non- 
linear models based on deep learning methodology [10]-[12]. More specifically, convolutional neural network 
(CNN) based approaches to face detection and recognition receive many real-life applications. In simple terms, 
the main idea of deep learning is to train a nonlinear face recognition function using the stochastic gradient 
descent (SGD) algorithm. 

Various works consider improvements of the above-mentioned algorithms and compare the 
recognition rate to others using certain databases of faces. The performance of PCA with nearest distance 
criteria was compared against the LDA with Euclidean distance criteria in [13] using AT&T (ORL) database 
[14] and LDA showed better performance with a maximum recognition rate of 95% when 90% of the database 
images were allocated to the training set. Another work [15] also uses an ORL database to compare the 
accuracy of the original SRC to the proposed discriminative common vector dictionary-based SRC (DCV- 
SRC). The maximum accuracy of 96.54+1.69 is reached with DCV-SRC when 70% of the ORL dataset is split 
into the training set while SRC showed an accuracy of 95.58+2.24. In a recent work [16], a version of the KNN 
classifier based on kernel discriminant analysis together with a support vector machine was proposed and it 
was reported to have 96% of recognition accuracy when the ORL database is used with 60% split into the 
training set. Lastly, we note that when 80% of the ORL faces were allocated to the training set, the CNN 
reported having a recognition rate of 95% [17]. 

Our goal in this article is to propose an improvement to Eigenfaces and compare the novel algorithm 
to the above-mentioned algorithms using the ORL dataset of facial images. In the next section we introduce 
ORL dataset and the proposed algorithm in detail. Section three contains the results of the experiment. We end 
with a discussion and suggestions for future work. 


2. METHOD 

In this section, we provide information on the ORL dataset we study and the proposed (PCA-TP) 
classifier which merges PCA and triplet similarity embedding together with projection-based recognition 
metric. In the first part we use the principal component analysis for dimensionality reduction followed by triplet 
similarity embedding. The final part makes use of the orthogonal projection of a vector onto a subspace to predict 
the label. 


2.1. ORL dataset 

In this work, we use the ORL face database [18] which contains 400 images of 40 individuals (10 images per 
person) under different conditions such as lighting, expressions and taking time as shown in Figure 1 for a few samples 
and in Figure 2 for one sample. Images are grayscale and have a size of 112x92. All images are split into training and 
test sets in a stratified manner, which means sets have the same proportion of subjects. 
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Figure 1. Samples from ORL dataset 
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Figure 2. 10 different images of one individual from ORL dataset 
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2.2. Proposed face classifier PCA-TP 

We let Train denote the labeled set {x,, X3, ..., Xy} of m dimensional vectors that will be used to train 
the classifier, Test denote the labeled set {y,, Y2, ..., Yy} of m dimensional vectors that will be used to test the 
accuracy of the classifier. We now introduce our proposal classifier PCA-TP. It has three essential parts, 
namely, dimension reduction with PCA, training of the triplet similarity embedding, and using the Projection 
as a similarity metric. We now explain each part in detail. 

One of the well-known dimension reduction techniques is the principal component analysis (PCA) 
[19]. In computer vision, it is used to represent an image with a relatively small dimensional feature vector as 
we introduce: Let A = [x1 x2 =: Xn] be the m x n matrix with m-dimensional vectors from Train stacked into 


; : : 1 
columns. We let yt be the row average of A, that is, the m-dimensional column vector 4 = z YL X; and set 
Xo = X — y where the vector p is subtracted from each column of X. Next, we consider the covariance matrix 
<x 0 x which is readily seen to be symmetric. Hence, its eigenvalues are all real and we may list them in 
"E 


decreasing order {A, > Az = = È An}. Let {e1, e2, , €n} be the corresponding eigenvectors. To reduce the 
dimension m into d, we form a d X m matrix Xpca from the first d eigenvectors {e1, e2, +, €q}. For our 
purposes, we simply take d to be the number of nonzero eigenvalues counting the multiplicity. As the non- 
zero eigenvalues of Xo xX, and XX o are identical, we see that d is at most (m,n). Once Xpca is computed, 
any m-dimensional vectors x representing an image can be transformed into a d-dimensional feature vector x’ 
via x’ = Xpca (x — p). 


2.3. Triplet similarity embedding 

Triplet similarity embedding is a deep learning method [20] used in computer vision to improve face 
recognition, verification and clustering. Idea was to train an embedding f from the space of feature vectors so 
that for any anchor {a} and positive {p} with the same label and a negative {n} with a different label, it should 
hold dist(f (a), f(p)) < dist(f (a), f (n)) for a given metric dist(,). To avoid trivial solutions a positive 
hyperparameter æ > 0 is included and then the training set is used to minimize the (1): 


max(0,a + dist(f(a), f(p)) — dist(f (a), f (n))) for all triplets {a,p,n} c Train (1) 


In this article, a linear transformation f is trained and projection (dot product) based approach adopted 
from [21] is used. The formula cos(@) = (u,v), where (, ) is the dot product operation, gives the cosine of the 
angle 0 between two unit vectors u,v. Therefore, if u, v are feature vectors representing the same face, they 
are expected to be close to one other, making 0 close to zero and thus cos(@) large. Geometrically, if v has a 
unit length, the dot product till the sign can be assumed as an Euclidean norm of projection of u onto the span 
of v. The closer vectors make the dot product larger in the projection setting and then one needs to train the 
similarity transformation f so that (2) is minimal. 


max (0,a + (f(a), f) - (F(@),f@))) (2) 
For a linear transformation f, we consider f(x) = Wx for some r x d matrix W where d is the 
dimension of feature vectors obtained as a result of dimension reduction due to PCA and integer r < dis a 


hyperparameter. Hence, W reduces the dimension feature vectors from d further to r. Note that 
(f (x), f(v)) = f(x)" f(y). Hence, the minimization problem has the final form (3): 


argminy Yiapnymax(0,a + a” W'Wn-a'W'Wp), (3) 
where {a, p, n} runs from the training set of feature vectors. To optimize the matrix W, the stochastic gradient 
descent (SGD) is used together with l,-regularization to avoid overfitting. As the gradient of an individual loss 
satisfies (4): 

V(al’W'Wn — a’W' Wp) = W(an" + na’) — W(ap’ + pa’), (4) 


in each iteration of SGD the W gets updated (5) as: 


Wai =W,- PW(a(n a p)” + (n z p)a' ) —yw (5) 
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where yW is the l -regularization. For the experiment, we iterate SGD 1,000,000 times for the hyperparameters 
a = B = 0.001, y = 0.00001 and initiate our matrix W as ad x d identity matrix. 


2.4. Projection as a similarity metric 

In computer vision there are various metrics considered to test if two vectors belong to the same label, 
see e.g [22] in which 15 different distance functions are used. We just use the Euclidean norm of the projection 
onto a subspace. The orthogonal projection for an m Xm matrix onto the subspace V is defined as 
P = A(A™A)~1A? (see [23]) where A is an m X n matrix with columns consisting of n basis vectors of V. It is 
readily seen that P satisfies PT = P and P? = P. In our setting, V is a subspace that holds different images for 
a single label. Images from the test set may be assumed to share the same label if their Euclidean norm of Py 
for a feature vector y is higher than projections onto subspaces of other labels. The Euclidean norm square for 
Py satisfies (6): 


|Py|l3 = (Py, Py) = y’P’ Py = y’P*y = y"Py (6) 


Therefore, this formula can be used to compare the projection norms for each labeled class. In order to label a 
test feature vector y, the following can be written as (7): 


y label = argmax; yT Piy (7) 


where P; is the matrix that is projected to the subspace formed by the feature vectors from the training 
set that belong to the label i. We point out that A has to include the columns that are linearly independent, and 
particularly the training set images need not to be redundant. To summarize we have the following short 
description of Algorithm 1. We denote PCA-P as the same algorithm except the triplet similarity embedding is 
ignored, that is, SGD is not implemented. 


Algorithm 1. Proposed PCA-TP algorithm 
Input: m x n matrix A = [A,A2 ©: An] for n labels, y E R™,7,t E N, a,B,y >0,w E 
[0,1] 
Output: label for y 
Columns of A and y are normalized w.r.t. l,-norm 
PCA is implemented to reduce A to an d X n matrix A’ 
y is transformed into y’ in R@ with PCA transformation 
Set W=r x d as identity matrix 
Train a triplet similarity with SGD for k x d matrix W according to (1): 
fori = 1to t do 
take random {a, p, n} from columns A with {a, p} the same label. 
ifa+a™W'’Wn-a™W'Wp > 0 do 
W:=W-B£W (a(n- p)" + (n-p)a")-y W 
end for 
Compute A” = WA’ and y” = Wy’ 
fori = 1ton do 
Calculate the similarity metric R; regarding to (2) 
end for 
Assign label = arg arg {R;} 


3. RESULTS AND DISCUSSION 

This work compares the performance of KNN to the proposed algorithm PCA-TP using the ORL 
dataset. Four different scenarios were considered where the dataset is randomly split into training and test sets 
with ratios 80/20, 60/40, 40/60, and 20/80. We compare the accuracy of our finding to that of the k-Nearest 
Neighborhood classifier (KNN). We only report KNN accuracy for k =1 as the greater values of k 
underperforms. The results of our experiments are summarized in Table 1. 

Table 1 shows that both PCA-TP and PCA-P have better accuracy almost for all types of train/test 
portions than that of KNN (for detailed information on KNN we refer to [24] and [25]). It is clearly seen that 
PCA-TP has superiority over other algorithms for all conditions. We note that with the increase of the training 
set, the similarity embedding methodology has not really improved the classifier. On the other hand, for 
instance, when the training set has as few as 20% of the dataset, additional 3% accuracy is supplied by triplet 
similarity embedding. 
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The graphs of loss function f (similarity embedding transformation) from the training of the similarity 
embedding with SGD for each train/test ratio respectively from high to low are given in Figure 3. The graphs 
show that there exists more distortion for a low ratio of train/test split and as seen in Table 1, this distortion 
influences the accuracy of the algorithm. Table 2 provides some of the commonly mislabeled images and their 
false positives. Here false matching is understandable because the pairs are almost similar to each other. 
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Figure 3. Training triplet similarity embedding matrix with SGD for various splitting of the dataset (from 
80/20 to 20/80 train/test ratio) 


Table 1. Accuracy of classifiers 
Train/test split (%) _PCA-TP(%) PCA-P(%) _ KNN (k=1) (%) 


80/20 98.75 98.75 95 
60/40 97.5 97.5 95 
40/60 93.33 92.5 92.5 
20/80 84.375 81.56 81.88 


Table 2. The commonly mislabeled images and their false positives 
False positives Actual faces False positives Actual faces 
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4. CONCLUSION 

In this work, we proposed a new robust algorithm for face recognition. It was established by 
combining the well-known principal component analysis to get a dimensionality reduction first. One of the 
effective deep learning methods, triplet similarity embedding was applied secondly for the learning process 
and finally used the Projection as a similarity metric for classification. The proposed algorithm was applied to 
ORL dataset for different train /test portions and the results were compared to that of the PCA-P „that is without 
triplet similarity, to check the effect of it and then to one of the well-known machine learning algorithms kNN. 
The comparisons showed that the combination of the three methods which gives us PCA-TP is a well- 
established algorithm for higher performance. It has superiority over others almost for all conditions. 
Unfortunately, there exists a high deterioration of the recognition rate with low train /test value. As a future 
work one can develop an algorithm to compensate for this rapid decrease in accuracy for lower train sets. 
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